selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch by hanbitmyths · Pull Request #2473 · microsoft/Olive

hanbitmyths · 2026-05-22T23:45:08Z

This PR hardens SelectiveMixedPrecision (SMP) for real-world LLMs targeting ONNX Runtime GenAI:

QKV-aware quant config overrides (olive/passes/pytorch/quant_utils.py): Normalize the per-layer override dict so that the Q, K, and V projections in the same attention block always share precision. ModelBuilder's GQA fusion requires this; without it, partial overrides silently break export on Qwen-style models.
AUTO kld_memory_mode (olive/passes/pytorch/selective_mixed_precision.py): A new auto setting selects among full, multi_gpu, low_memory, and offload based on visible GPU memory and estimated model footprint, and logs the decision (e.g. KLD memory mode auto-selected: multi_gpu (gpus=3, full=145.14GB, multi_budget=215.86GB, ...)).
New multi_gpu mode: Uses accelerate.dispatch_model + infer_auto_device_map with _no_split_modules honored. After infer_auto_device_map, every model.layers.N.* entry is coalesced to the first device assigned for that layer, and a defensive check falls back to low_memory if a decoder layer still spans devices. A diagnostic info log reports the per-device layer counts.

Validation (A100 VM)

Qwen3-0.6B old vs new export: tokens identical (124 vs 116 overrides, new_missing_qkv_partners=[]), same 657 MB output, ~301 vs 309 tok/s.
Qwen2.5-1.5B-Instruct export + ort-genai: 1.34 GB int4, 290 tok/s.
Qwen2.5-14B-Instruct AUTO → MULTI_GPU (3×A100), 9.44 GB int4, 95 tok/s.

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Model	N	PyTorch	ort-genai	Δ
Qwen3-0.6B	500	36.6%	28.6%	−8.0 pp
Qwen2.5-1.5B-Instruct	500	60.2%	54.2%	−6.0 pp
Qwen2.5-14B-Instruct	250	74.8%	77.2%	+2.4 pp (within ±5.5 pp CI)

14B is essentially lossless; the small-model deltas are inherent to int4 SMP on sub-2B parameters, not regressions introduced here.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass. (24 passed, 1 skipped in test_selective_mixed_precision.py)
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

Release note: SelectiveMixedPrecision now supports an auto setting for kld_memory_mode and a new multi_gpu mode that shards the KLD-scored forward across visible GPUs via Accelerate. Quant config overrides are normalized so Q/K/V projections in the same attention block share precision, ensuring compatibility with ModelBuilder GQA fusion.

…TI_GPU dispatch - Normalize per-layer quant config overrides so Q/K/V projections in the same attention block share precision, required by ModelBuilder for GQA fusion. - Add AUTO setting for kld_memory_mode that picks among FULL, MULTI_GPU, LOW_MEMORY, OFFLOAD based on available GPU memory and model size. - Add MULTI_GPU mode that uses Accelerate's dispatch_model with _no_split_modules honored, plus a coalescing pass that pins every model.layers.N.* entry to a single device and falls back to LOW_MEMORY if a decoder layer still spans devices. - Tests: 24 unit tests covering QKV grouping, AUTO selection thresholds, and the MULTI_GPU device-map coalescing path.

Copilot

Pull request overview

This PR strengthens the SelectiveMixedPrecision (SMP) PyTorch pass for LLMs targeting ONNX Runtime GenAI by (a) enforcing Q/K/V consistency in both scored selection and quantization overrides, and (b) adding an auto/multi_gpu KLD-gradient scoring memory mode selection to make scoring practical on large models.

Changes:

Add Q/K/V-aware grouping so scored selection promotes attention input projections together, and normalize quantization overrides so Q/K/V share the most-precise config.
Introduce kld_memory_mode with auto resolution plus a new multi_gpu mode using Accelerate dispatch and device-map coalescing/validation.
Expand unit tests to cover QKV grouping/normalization, KLD scoring equivalence across memory modes, and AUTO/MULTI_GPU selection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`olive/passes/pytorch/selective_mixed_precision.py`	Adds QKV grouping in scored overrides and implements AUTO/FULL/MULTI_GPU/LOW_MEMORY/OFFLOAD KLD scoring paths with heuristics and Accelerate-based sharding.
`olive/passes/pytorch/quant_utils.py`	Adds QKV group discovery + override normalization to ensure attention input projections share a consistent quant config, including support for excluded attention inputs.
`test/passes/pytorch/test_selective_mixed_precision.py`	Adds extensive unit tests for QKV grouping/normalization and KLD scoring/memory-mode behavior, including MULTI_GPU dispatch stubbing.

hanbitmyths and others added 3 commits May 22, 2026 16:44

docs: surface KLD memory modes and QKV grouping in pass docstring

cc52a7e

Merge branch 'main' into smp-qkv-aware-multi-gpu

a4a2b2a

hanbitmyths marked this pull request as ready for review May 22, 2026 23:51

Copilot AI review requested due to automatic review settings May 22, 2026 23:51

Copilot started reviewing on behalf of hanbitmyths May 22, 2026 23:51 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread olive/passes/pytorch/quant_utils.py

Comment thread olive/passes/pytorch/selective_mixed_precision.py Outdated

Address SMP review feedback

8e98a92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473
hanbitmyths wants to merge 4 commits into
microsoft:mainfrom
hanbitmyths:smp-qkv-aware-multi-gpu

hanbitmyths commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hanbitmyths commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation (A100 VM)

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Checklist before requesting a review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hanbitmyths commented May 22, 2026 •

edited

Loading